High-Dimensional Graphical Model Selection Using 
^i-Regularized Logistic Regression 

Pradeep Ravikumar^^ Martin J. Wainwright*'^ 

pradeeprSstat . berkeley . edu wainwrigOstat . berkeley . edu 

OO . Department of Statistics^, and Department of EECS* 

§ ; UC Berkeley, Berkeley, CA 94720 

(N 

^ ' John D. Lafferty 

_^ ■ lafferty@cs.cmu.edu 

\^ ! Departments of Computer Science and Machine Learning 

^N I Carnegie Mellon University 

Pittsburgh, PA 15213 

H. 

^/^ . Technical Report, Department of Statistics, UC Berkekley 

^ ; April 19, 2008 

-)— » 

^. 

Abstract 

We consider the problem of estimating the graph structure associated with a discrete 
Markov random field. We describe a method based on £i-regularized logistic regression, 
^ ' in which the neighborhood of any given node is estimated by performing logistic regres- 

Cn , sion subject to an £i-constraint. Our framework applies to the high-dimensional setting, 

in which both the number of nodes p and maximum neighborhood sizes d are allowed to 
grow as a function of the number of observations n. Our main results provide sufficient 
conditions on the triple {n,p, d) for the method to succeed in consistently estimating the 
neighborhood of every node in the graph simultaneously. Under certain assumptions 
^Q ' on the population Fisher information matrix, we prove that consistent neighborhood 

f^ , selection can be obtained for sample sizes n — fl{d^logp), with the error decaying as 

0{ex-p{—Cn/d^)) for some constant C. If these same assumptions are imposed directly 
on the sample matrices, we show that n = n{d^ logp) samples are sufficient. 
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1 Introduction 

Undirected graphical models, also known as Markov random fields (MRFs), are used in a 
variety of domains, including artificial intelligence, natural language processing, image anal- 
ysis, statistical physics, and spatial statistics, among others. A Markov random field (MRF) 
is specified by an undirected graph G = {V, E), with vertex set V = {1,2, . . . ,p} and edge set 
E C V xV. The structure of this graph encodes certain conditional independence assump- 
tions among subsets of the p-dimensional discrete random variable X = (Xi,X2, . . . ,Xp), 
where variable Xi is associated with vertex i £ V. A fundamental problem is the graphical 



model selection problem: given a set of n samples {x 



(1) J2) 



X^ 



.W 



} from a Markov ran- 



dom field, estimate the structure of the underlying graph. The sam^ple complexity of such 
an estimator is the minimal number of samples n, as a function of the graph size p and 
possibly other parameters such as the maximum node degree d, required for the probability 
of correct identification of the graph to converge to one. Another important property of 
any model selection procedure is its com,putational com,plexity. 

Due to both its importance and difficulty, structure learning in random fields has at- 
tracted considerable attention. The absence of an edge in a graphi cal model encodes a 
conditional independence assumption. Constraint-based approaches (jSpirtes et al.L 12000) 



estimate these conditional independencies from the data using hypothesis testing, and then 
determine a graph that most closely represents those independencies. Each graph represents 
a model class of graphical models; learning a graph then is a model class selection problem. 
Score-based approaches combine a metric for the complexity of the graph, with a goodness 
of fit measure of the graph to the data (for instance, log-likelihood of the maximum likeli- 
hood parameters given the graph), to obtain a score for each graph. The score is then used 
together with a search procedure that generates candidate graph struct ures to be s cored . 
The number of graph structures grows super-exponentially, however, and lChickerina ( 19951 ) 
shows that this problem is in general NP-hard. 

A complication for undirected graphical models is that typical score metrics involve 
the normalization constant (also called the partition function) associated with the Markov 
random field, which is intractable (#P) to compute for general undirected models. The 
space of candidate structures in scoring based approaches is thus typically restricted to 
either directed mode l s — Ba yesian netw orks — or si r nple u ndirected graph classes s uch a s 
trees (Chow a,nd Liul . Il968l ). polytrees ( Dasguptal . ll999J) and hypertrees ( Srebrd . l2003l ). 
Abb eel et al.l (120061 ) propose a method for learning factor graphs based on local conditional 
entropies and thresholding, and analyze its behavior in terms of Kullback-Leibler divergence 
between the fitted and true models. They obtain a sample complexity that grows logarith- 
mically in the graph size p, but the computational complexity grows at least as quickly as 
0{p^^), where d is the maximum neighborhood size in the graphical model. This order of 
complexity arises from the fact that for each node, there are (^) = Ojp '^) possible neigh- 



borhoods of size d for a graph with p vertices. ICsiszar and Talatal (|2006l ) show consistency 
of a method that uses pseudo-likelihood and a modification of the BIC criterion, but this 
also involves a prohibitively expensive search. 



In work subsequent to the initial conference version of this work (jWainwright et al 



20071 ) , other researchers have also studied the problem of mo del selection in discrete Markov 
random fields. For the special case of bounded degree models. iBresler et al.l (J2008l ) describe a 
simple search-based method, and prove under relatively mild assumptions that it can recover 
the graph structure with Q{dlogp) samples. However, in the absence of additional restric 



tions, the computational complexity of the method is 0{p'^~^^). ISanthanam and Wainwright 
([20081) analyze the information-theoretic limits of graphical model selection, providing both 
upper and lower bounds on various model selection procedures, but these methods also have 
prohibitive computational cost. 

The main contribution of this paper is to present and analyze the computational and 
sample complexity of a simple method for graphical model selection. Our analysis is high- 
dimensional in nature, meaning that both the model dimension p as well as the maximum 



neighborhood size d may tend to infinity as a function of the size n. Our main result shows 
that under mild assumptions, consistent neighborhood selection is possible with sample 
complexity n = $7((i^logp) and computational complexity 0{uiayi{n,p}p^), when applied 
to any graph with p vertices and maximum degree d. The basic approach is straightforward: 
it involves performing ^i-regularized logistic regression of each variable on the remaining 
variables, and then using the sparsity pattern of the regression vector to infer the underlying 
neighborhood structure. 

The technique of li regularization for estiniation of sparse models or signals has a long 
history in many fields; for instance, see lTroppI ( 20061 ) for one survey. A surge of recent work 



has shown that £^ -regularization can l e ad to practical algorithms with strong theoretical 
guara nt ees ( e.g., Candes and T ag ('200m. Ponoho and Elad ( 2003 ).lMeinsh ausen and Biihlmann 



(|2006l l. [n3 (|2004l l. iTroDpl (|2006 ), Wainwright (2006 ), Zhao and Yu (200^] )). Despite the 



well-known computational intractability of discrete MRFs, our method is computation- 
ally efficient; it involves neither computing the normalization constant (or partition func- 
tion) associated with the Markov random field, nor a combinatorial search through the 
space of graph structures. Rather, it requires only the solution of standard convex pro- 



grain s, with an overall computational complexity of order 0{m.ax{p,n}p^) (|Koh et al 



20071') ■ and is thus well-suited to high dimensional problems. Conceptually, like the work 



of iMeinshausen and Biihlmannl ( 20061 ) on covariance selection in Gaussian graphical mod- 



els, our approach can be understood as using a type of pseudo-likelihood, based on the 
local conditional likelihood at each node. In contrast to the Gaussian case, where the exact 
maximum likelihood estimate can be computed exactly in polynomial time, this use of a 
surrogate loss function is essential for discrete Markov random fields, given the intractability 
of computing the exact likelihood. 

The remainder of this paper is organized as follows. We begin in Section [2] with back- 
ground on discrete graphical models, the model selection problem, and logistic regression. 
In Section [3l we state our main result, develop some of its consequences, and provide a 
high-level outline of the proof. Section H] is devoted to proving a result under stronger 
assumptions on the sample Fisher information matrix itself, whereas Section [5] provides 
concentration results linking the population matrices to the sample versions. In Section [6l 
we provide some experimental results to illustrate the practical performance of our method, 
and the close agreement between theory and practice, and we conclude in Section [71 

Notation: For the convenience of the reader, we summarize here notation to be used 
throughout the paper. We use the following standard notation for asymptotics: we write 
f{n) = 0{g{n)) if /(n) < Kg{n) for some constant K < oo, and f{n) = Q(g{n)) if 
f{n) > K'g(n) for some constant K' > 0. The notation /(n) = Q{g{n)) means that 
f{n) = 0{g{n)) and f{n) = U,{g{n)). Given a vector a; G M and parameter q G [l,oo], we 
use \\x\\q to denote the usual (.q norm. Given a matrix X G M"^^ and parameter q G [1, cxd], 
we use III X III q to denote the induced m atrix-operator norm (viewed as a mapping from 
& — > ^"); see iHorn and JohnsonI ( 19851 ). Two examples of particular importance in this 



paper are the spectral norm |||X|||2, corresponding to the maximal singular value of X, and 
;he £oo matrix norm, given by |||^|||oo = ma 

II -'^ loo < Voi^llb) for any square matrix X G 



the £oo matrix norm, given by |||^|||oo = max Yl,k=i \-^jk\- We make use of the bound 

j=l,...,a 



paxa 



2 Background and problem formulation 

We begin by providing some background on Markov random fields, defining the problem 
of graphical model selection, and describing our method based on neighborhood logistic 
regression. 

2.1 Markov random fields 

Given an undirected graph G with vertex set V = {1, ■ ■ ■ ,p} and edge set E, a Markov 
random field (MRF) consists of random vector X = (Xi,X2, . . . ,Xp), where the random 
variable Xs is associated with vertex s (z V. The random vector X £ X^ is said to 
be pairwise Markov with respect to the graph if its probability distribution factorizes as 
P(x) oc exp < ^/ jNg^ (j)st{xs,xt) >, where each (j)st is a mapping from pairs {xs,xt) £ XsX Xt 
to the real line. An important special case is the Ising model, in which Xg S {—1, 1} for each 
vertex s G V, and (pst{xs, xt) = OstXgXt for some parameter 0*^ G R, so that the distribution 
takes the form 

Pe*(x) = — ^expi Y. <^*stxsxt\. (1) 

The partition function Z{6*) ensures that the distribution sums to one. The Ising model 
has proven useful in many domains, including statistical physics, where it describes the 
behavior of gases or magnets, in computer vision for image segmentation, and in social 
network analysis. 

2.2 Graphical model selection 

Suppose that we are given a collection {x*-*'} = {x^^\ . . . , rr^"'} of n samples, where each p- 
dimensional vector x^"^' is drawn in an i.i.d. manner from a distribution Pg* of the form ([1]). 
It is convenient to view the parameter vector 6* as a (2) "dimensional vector, indexed by 
pairs of distinct vertices, but non-zero if and only if the vertex pair (s, t) belongs to the 
unknown edge set E of the underlying graph G. The goal of graphical model selection is 
to infer the edge set E of the graphical model defining the probability distribution that 
generates the samples. In this paper, we study the slightly stronger criterion of signed edge 
recovery: in particular, given a graphical model with parameter 0* , we define the edge sign 
vector 

^, ^ fsign(0*i) if(5,t)Gi? ^2) 

1 otherwise. 

Note that the weaker graphical model selection problem amounts to recovering the vector 
\E*\ of absolute values. 

The classical notion of statistical consistency applies to the limiting behavior of an esti- 
mation procedure as the sample size n goes to infinity, with the model size p itself remaining 
fixed. In many contemporary applications of graphical models (e.g., gene microarrays, so- 
cial networks etc.), the model dimension p is comparable or larger than the sample size n, 
so that the relevance of such "fixed p" asymptotics is doubtful. Accordingly, the goal of 



this paper is to develop the broader notion of high- dimensional consistency, in which both 
the model dimension and the sample size are allowed to increase, and we study the scaling 
conditions under which consistent model selection is achievable. 

More precisely, we consider sequences of graphical model selection problems, indexed 
by the sample size n, number of vertices p, and maximum node degree d. We assume that 
the sample size n goes to infinity, and both the problem dimension p = p{n) and d = d{n) 
may also scale as a function of n. The setting of fixed p or d is covered as a special case. 
Let En be an estimator of the signed edge pattern E* , based on the n samples. Our goal is 
to establish sufficient conditions on the scaling of the triple {n,p, d) such that our proposed 
estimator is consistent in the sense that 



En — E 



-'n 



1 as n — > +00. 



We sometimes call this property sparsistency, as a shorthand for consistency of the sparsity 
pattern of the parameter 9* . 

2.3 Neighborhood-based logistic regression 

Note that recovering the signed edge vector E* of an undirected graph G is equivalent to 
recovering, for each vertex r £ V, its neighborhood set M{r) := {t £ V \ {r,t) £ E}, 
along with the correct signs sign(^*j) for all t G M{r). To capture both the neighborhood 
structure and sign pattern, we define the signed neighborhood set 

AA±(r) := {sign{9:.,)t \ tGM{s)}. (3) 

The next step is to observe that this signed neighborhood set A/±(r) can be recovered from 
the sign-sparsity pattern of the {p — l)-dimensional subvector of parameters 



\- 



{C, u G V\r} 



associated with vertex r. In order to estimate this vector 9^ , we consider the structure of 
the conditional distribution of Xr given the other variables X\r = {^t \ t G ^\{^}}- -^ 
simple calculation shows that under the model ([T|), this conditional distribution takes the 
form 



exp ( 2x 



z2tev\r Kt^t 



We*{xr\x\,) = j^ ^ ^ ^^^. (4) 



Q^V [iXr[Y.^^y\J;^xt]J +1 

Thus, the variable X^ can be viewed as the response variable in a logistic regression in 
which all of the other variables X\j. play the role of the covariates. 

With this set-up, our method for estimating the sign-sparsity pattern of the regression 
vector 9'^^ (and hence the neighborhood structure M±{r)) is based on computing the £1- 
regularized logistic regression of X^ on the other variables X\r. . Explicitly, given a set of n 
i.i.d. samples {x^^\x^'^\ . . . , x^"'}, this regularized regression problem is a convex program, 
of the form 

min {i{9-{x^^}) + \n\\9\r\\i] 



, where A^ > is a regularization parameter, to be specified by the user, and 

£(e;{x«}) := -if;iogP,(x« |x«) (5) 



n '■ — ' \^ 



is the rescaled negative log Hkehhood. (The rescahng factor \jn in this definition is for later 
theoretical convenience.) Following some algebraic manipulation, the regularized negative 
log likelihood can be written as 



where 



mm < 

1 



1 " 

-^/(0;xW)- Y. drullru^\n\\Q\r\\x\. (6) 



«=1 MeV\r 



f{Q-x) := log exp( Y, e^^xt) + exp(-[ ^ ^rM) (7) 

\ tav\r tav\r I 

is a rescaled logistic loss, and ^Iru '■= ^ Sj"=i ^r Xu are empirical moments. Note the objec- 
tive function Q is convex but not differentiable, due to the presence of the ^i-regularizer. 
By Lagrangian duality, the problem ([6]) can be re-cast as a constrained problem over the 
ball ||^\r||i < C{Xn)- Consequently, by the Weierstrass theorem, the minimum over 6\g is 
always achieved. 

Accordingly, let 91^ be an element of the minimizing set of problem ([U]). Although 91^ 
need not be unique in general since the problem ([6]) need not be strictly convex, our analysis 
shows that in the regime of interest, this minimizer 9\'^ is indeed unique. We use 9\'^ to 
estimate the signed neighborhood N'±{r) according to 

A4(r) := {^sign{9ru)u \ u G V\r, 9su ^ Oy (8) 

We say that the full graph G is estimated consistently, written as the event {G = G{p, d)}, 
if A4(r) = M±{r) for all r £V. 

3 Method and theoretical guarantees 

Our main result concerns conditions on the sample size n relative to the parameters of 
the graphical model — more specifically, the number of nodes p and maximum node degree 
d — that ensure that the collection of signed neighborhood estimates ([8]), one for each node r 
of the graph, agree with the true neighborhoods, so that the full graph G{p, d) is estimated 
consistently. In this section, we begin by stating the assumptions that underlie our main 
result, and then give a precise statement of the main result. We then provide a high- 
level overview of the key steps involved in its proof, deferring detail to later sections. Our 
analysis proceeds by first establishing sufficient conditions for correct signed neighborhood 
recovery — that is, {A/±(r) = A/±(r)} — for some fixed node r £ V. By showing that this 
neighborhood consistency is achieved at exponentially fast rates, we can then use a union 
bound over all p nodes of the graph to conclude that consistent graph selection {G = 
G{p,d)} is also achieved. 



3.1 Assumptions 

Success of our method requires certain assumptions on the structure of the logistic re- 
gression problem. These assumptions are stated in terms of the Hessian of the likelihood 
function E{logP6i[Xr | ^\r]}) as evaluated at the true model parameter 9^ £ W~^. More 



\r 



specifically, for any fixed node r eV, this Hessian is a. {p — 1) x (p — 1) matrix of the form 



Q; := Ee. {v2logPe4X, | Xy]} 
For future reference, we calculate the explicit expression 



q: 



E«. 



r/(X;r)XyX 



(9) 



(10) 



where 



r]{u;9) :-- 



4exp(^2«r T,t&v\r^' 



rtUt 



(exp ( 2u,.Etey\r ^nUt] ] + 1 



(11) 



is the variance function. Note that the matrix Q* is the Fisher information matrix as- 
sociated with the local conditional probability distribution. Intuitively, it serves as the 
counterpart for discrete graphical models of the covariance matrix E[XX ] of Gaussian 
graphical models, and indeed our assumptions are analogous to those imposed in previous 
work on the Lasso for Gau ssian linear regression ( Meinshausen and Biihlmannl . 120061 . iTroppl . 
20061 . IZhao and y1 boOTJ l . 



In the following we write simply Q* for the matrix Q* , where the reference node r should 
be understood implicitly. Moreover, we use S := {(r, t) | t S AA(r)} to denote the subset of 
indices associated with edges of r, and S^ to denote its complement. We use Q*gg to denote 
the d X d sub-matrix of Q* indexed by S. With this notation, we state our assumptions: 

[Al] Dependency condition: The subset of the Fisher information matrix correspond- 
ing to the relevant covariates has bounded eigenvalues: there exists a constant Cmin > 
such that 



J^min\Qss) — ^" 



(12) 



Moreover, we require that Kmax{^e*[X\r^\r]) ^ -^max- These conditions ensure that the 
relevant covariates do not become overly dependent. (As stated earlier, we have suppressed 
notational dependence on r; thus this condition is assumed to hold for all r G V .) 

[A2] Incoherence condition: Our next assumption captures the intuition that the large 
number of irrelevant covariates (i.e., non-neighbors of node r) cannot exert an overly strong 
effect on the subset of relevant covariates (i.e., neighbors of node r). To formalize this 
intuition, we require the existence of an a G (0, 1] such that 



WQhsiQ 



ss) 



< 1 



a. 



(13) 



3.2 Statement of main result 

We are now ready to state our main result on the performance of neighborhood logistic 
regression for graphical model selection. Naturally, the limits of model selection are deter- 
mined by the minimum value over the parameters 9*^ for pairs (r, t) included in the edge 
set of the true graph. Accordingly, we define the parameter 

Cin = min l^rtl- (14) 

(r,t)£E 

With this definition, we have the following 

Theorem 1. Consider a sequence of graphs {G{p,d)} such that conditions Al and A2 are 
satisfied by the population Fisher information matrices Q* . If the sample size n satisfies 

n > Ld^ log(p) (15) 

for some constant L, and the minimum value 6*^:^^ decays no faster than 0{\/d), then for the 

regularization sequence A„ = 2-v/-^^, the estimated graph G{Xn) obtained by neighborhood 
logistic regression satisfies 

F[G{Xn) = G{p,d)] = o(exp(-i^J-31og(p))) ^ (16) 

for some constant K. 



Remarks: For model selection in graphical models, one is typically interested in node 
degrees d that remain bounded (e.g., d = 0{1)), or that grow only weakly with graph size 
(say d = o{p). In such cases, the growth condition (J15p allows the number of observations 
to be substantially smaller than the graph size, i.e., the "large p, small n" regime. In 
particular, the graph size p can grow exponentially with the number of observations (i.e, 
p{n) = exp(n") for some a S (0, 1). 

In terms of the choice of regularization, the sequence A„ needs to satisfy the following 
conditions: 

nA^ > 21og(p), and VdXn = 0(9^^^). 



Under the growth condition (fT5]) . the choice A^ = 2w-2|^ suffices as long as 0^^^ decays no 
faster than 0{l/d). 

The analysis required to prove Theorem[T]can be divided naturally into two parts. First, 
in Sectional we prove a result (stated as Proposition [T|) for "fixed design" matrices. More 
precisely, we show that if the dependence (Al) mutual incoherence (A2) conditions hold for 
the sample Fisher information matrix 

Q" := E[rj{X;9*)X\rX^ = -^r?(x«;r)x«(xg)^ (17) 

then the growth condition (llSp and choice of An from Theorem [1] are sufficient to ensure that 
the graph is recovered with high probability. Interestingly, our analysis shows that if the 



conditions are imposed directly on the sample Fisher information matrices and 0^:^^ = @{1), 
then the weaker growth condition n = Q{d?log{p)) suffices for asymptotically exact graph 
recovery. 

The second part of the analysis, provided in Section [5l is devoted to showing that under 
the specified growth conditions (A3), imposing incoherence and dependence assumptions 
on the population version of the Fisher information Q* guarantees (with high probability) 
that analogous conditions hold for the sample quantities Q^. While it follows immediately 
from the law of large numbers that the empirical Fisher information Q^^ converges to the 
population version Qaa ^^^ ^^y fixed subset A, the delicacy is that we require controlling 
this convergence over subsets of increasing size. The analysis therefore requires some large- 
deviations bounds, so as to provide exponential control on the rates of convergence. 

3.3 Primal-dual witness for graph recovery 

At a high-level, at the core of our proof lies the notion of a primal-dual witness. In particular, 
we explicitly construct an optimal primal- dual pair, namely, a primal solution 9, along with 
an associated subgradient vector z (which can be interpreted as a dual solution), such that 
the Karush-Kuhn- Tucker (KKT) conditions associated with the convex program ([6]) are 
satisfied. Moreover, we show that under the stated assumptions on {n,p,d), the primal- 
dual pair (9, z') can be constructed such that they act as a witness — that is, a certificate 
guaranteeing that the method correctly recovers the graph structure. 
Let us write the convex program ([6]) in the form 



mm 

9\r 



_^{£(0;{x«}) + A„||e\,||i}, (18) 

where 

1 " 

^(^;{X«}) = i{9) = -^/(0;x«)- Yl ^ruflru (19) 

i=l ueV\r 

is the negative log likelihood associated with the logistic regression model. The KKT 
conditions associated with this model can be expressed as follows 

Vi(9) + Xnz = (20) 

where the dual or subgradient vector z G MP^^ satisfies the properties 

Zrt = sign{9rt) if 9i ^ 0, and Iz^l < 1 otherwise. (21) 

One way to understand this vector z is as a subgradient, meaning an element of the sub- 
differential of the ^i-norm (see Rockafellar, Il970). An alternative interpretation is based 
on the constrained equivalent to problem (fTS]) . involving the constraint \\9\\i < C(A„). 
This £i-constraint is equivalent to the family of constraints iJ^9 < C, where the vector 
V G {—1, +1}^~^ ranges over all possible sign vectors. In this formulation, the optimal dual 
vector is simply the conic combination 

Xnz= V alv, (22) 






v£{-l,+l}P- 



where a* > is the Lagrange multipHer associated with the constraint 'irO < C . 

The KKT conditions (pOj) and (pT]) must be satisfied by any optimal pair {9, z) to the 
convex program p8|) . In order for this primal-dual pair to correctly specify the graph 
structure, we require furthermore that the following properties are satisfied: 

sign{zrt) = sign(6i;j for ah {r,t) £ S := {{r,t) \ t G M{r)}, and (23a) 

630=0 where S" :={{r,u) \ {r,u) ^ E}. (23b) 

We now construct our witness pair {9,z) as follows. First, we set ^5 as the minimizer 
of the partial penalized likelihood. 



arg mm 



^{m{x('^}) + Xn\\es\\i}, (24) 



and set zs = sign{6s)- We then set 6s<= = so that condition ()23bp holds. Finally, we 
obtain z^c from equation (j20p by plugging in the values of 9 and zs- Thus, our construction 
satisfies conditions (|23b[) and (120p . The remainder of the analysis consists of showing that 
our conditions on {n,p, d) imply that, with high-probability, the remaining conditions (|23ap 
and (f2T]) are satisfied. 

This strategy is justified by the following lemma, which provides sufficient conditions 
for shared sparsity and uniqueness of the optimal solution: 

Lemma 1 (Shared sparsity and uniqueness). // there exists an optimal primal solution 
9 with associated optimal dual vector z such that Wzs^Woo < 1, then any optimal primal 
solution 9 must have 9s<: = 0. Moreover, if the Hessian sub-matrix [V^£(0)]s5 >- 0, then 9 
is the unique optimal solution. 



Proof. By Lagrangian duality, the penalized problem (jlSh can be written as an equiv- 
alent constrained optimization problem over the ball \\9\\i < C{\n), for some constant 
C{Xn) < +00. Since the Lagrange multiplier associated with this constraint — namely, A^ — 
is strictly positive, the constraint is active at any optimal solution, so that \\9\\i is constant 
across all optimal solutions. Consider the representation of z as the convex combination ()22p 
of sign vectors v £ {— 1,+!}^"^, where the weights a* are non-negative and sum to one. 
Since a * is an optimal v ector of Lagrange multipliers for the optimal primal solution 9, it 
follows ( Bertsekasl . ll995l ) that any other optimal primal solution 9 must minimize the associ- 



ated Lagrangian (i.e., satisfy equation ([20]) ). and moreover must satisfy the complementary 
slackness conditions a*[ir9 — C] = for all sign vectors v. But these conditions imply that 
'z^9 = C = \\9\\i, which cannot occur if 9j 7^ for some index j for which |zj| < 1. We thus 
conclude that ^^c = for all optimal primal solutions. 

Finally, given that all optimal solutions satisfy 9s<: = 0, we may consider the restricted 
optimization problem subject to this set of constraints. If the principal submatrix of the 
Hessian is positive definite, then this sub-problem is strictly convex, so that the optimal 
solution must be unique. D 

In our primal-dual witness proof, we exploit this lemma by constructing a primal-dual 
pair {9,z) such that ||z5c||oo < 1. Moreover, under the conditions of Theorem [H we prove 
that the sub-matrix of the sample Fisher information matrix is strictly positive definite with 
high probability, so that the primal solution 9 is guaranteed to be unique. 

10 



4 Analysis under sample Fisher matrix assumptions 

We begin by establishing model selection consistency when assumptions are imposed di- 
rectly on the sample Fisher matrix Q", as opposed to on the population matrix Q* , as in 
Theorem [TJ In particular, we define the "good event" 

M{{x^^}) := Ux^^ I g" = i[V2£(r)] satisfies Al and ^2} . (25) 

We then state the following 

Proposition 1 (Consistency for fixed design). If n > L(flog{p) for a suitably large con- 
stant L, and the minimum value ^^j^^ decays no faster than 0{l/y/d), then for the regular- 



ization sequence A„ = 2a/-2££, the estimated graph G{\n) obtained by neighborhood logistic 
regression satisfies 

P[G(A„) = G{p,d) I A^({x«})] = O (exp {-n\l - log(p))) ^ 0. (26) 

Loosely stated, this result guarantees that if the sample Fisher information matrix is 
"good" , then the conditional probability of successful graph recovery converges to zero at 
the specified rate. The remainder of this section is devoted to the proof of Proposition [TJ 

4.1 Key technical results 

We begin with statements of some key technical lemmas that are central to our main 
argument, with their proofs deferred to Appendix [XI The central object is the following 
expansion, obtained by re-writing the zero-subgradient condition as 

V£(^;{x«})-V^(r;{x«}) = W-\nZ, (27) 

where we have introduced the short-hand notation W^ for the {p — l)-vector 

1^":=-Vf(r;{x«}) = 

n ^ ""^^ y' exp(Eieyv Qlt^f) + exp(-[E,evV ^*^^ 

For future reference, note that E5)*[VF"] = 0. Next, applying the mean- value theorem 
coordinate-wise to the expansion ([27|) yields 

V2£(r;{x(^)}) [^-r] = W"-A„£+i?", (28) 

where the remainder term takes the form 

V2^(^(^);{xW})-V2£(r;{x«})l^ (^-r), (29) 



R- 



with Q^^' a parameter vector on the line between Q* and ^, and with \Yj denoting the jth 
row of the matrix. The following lemma addresses the behavior of the term W^ in this 
expansion: 
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Lemma 2. If nX"^ > log{p), then for the specified mutual incoherence parameter a G (0, 1], 
we have 



2-a 

An 



W 



a 

l°° - 4 



0{exp{-nXl + log{p))) ^ 0. 



(30) 



See Appendix lA.il for the proof of this claim. 

The following lemma establishes that the sub-vector 6s is an ^2-consistent estimate of 
the true sub- vector 9*q: 



c„ 



Lemma 3 (^2-consistency of primal subvector). // Xnd < j^q^'" , then as n —> -|-oo, we 
have 

\\0s - 0*sh = Op (VdXn) ^ 0. (31) 



See App endix I A . 2 1 for the proof of this claim. 

Our final technical lemma provides control on the the remainder term (|29p : 

Lemma 4. If nX^ > log(p) and dXn is sufficiently small, then for mutual incoherence 
parameter a € (0, 1], we have 



2-a 

Xn. 



i?"| 



a 
> — 
- 4 



0(exp(-nA2+log(p))) ^ 0. 



(32) 



See Appendix IA.3I for the proof of this claim. 

4.2 Proof of Proposition [1] 

Using these lemmas, we can now complete the proof of Proposition [TJ Recalling our short- 
hand Q" = \70£{9*] {x^*-*}), we re-write condition (I28p in block form as: 



Q's^s [Os - 0*s] = W^.-Xnzsc + R'^sc, (33a) 

Q'ss[Os-0*s] = WS-XnZs + R'^s- (33b) 

Since the matrix Q^g is invertible by assumption, the conditions (j33p can be re-written as 

Q's^s (Q55)"' m - ^nZS + R'h] = W^r^ - XnZs^ + R'^s^. (34) 



Rearranging yields the condition 

[WS. - i?gc] - Q^s^s iQ'ss)-' m - R's] + ^nQ%s [Q'ssY'^s = Xnzs^. (35) 



Strict dual feasibility We now demonstrate that for the dual sub-vector ^s'^ defined 
by equation (|35p . we have 1 1 25c ||oo < 1. Using the triangle inequality and the mutual 
incoherence bound p^ . we have that 



F5= 



< Ws'^smsT'i 



\w, 



SHOO 



A„ 



+ 



\R 



Slloo 



Xr 



+ 1 



I P" II llT/f/*^ II 

|-'''5c||c)0 I K f^ 5c 1 1 CJO 



A,. 

< {I -a) + {2 -a) 



Xn 



An 



+ 



|VF"|| 

An 



(36) 



(37) 
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Next, applying Lemmas [2] and HI we have 
with probabiUty converging to one. 



a a a 

< (l-«)+4 + 4 =^-2- 



Correct sign recovery: We next show that our primal sub-vector 6s defined by equa- 
tion (j24p satisfies sign consistency, meaning, sgn(^5) = sgn(0J). In order to establish this, 
it suffices to show that 



11^5 



< 



where we recall the notation 9^^^ := min(^^t)g£; |0*j|. From LemmaEJ we have \\9s — 0*s\\2 
Op{VdX„), so that 



■11^5 




(38) 
(39) 



Since 6*^^;^ decays no faster than 0(l/\/d), the right-hand side is upper bounded by 
(D{Xnd), which can be made smaller than 1 by choosing A„ sufficiently small, as asserted in 
Proposition [TJ 



5 Uniform convergence of sample information matrices 

In this section, we complete the proof of Theorem[T]by showing that if the dependency {Al) 
and incoherence (^2) assumptions are imposed on the population Fisher information matrix 
then under the specified scaling of {n,p,d), analogous bounds hold for the sample Fisher 
information matrices with probability converging to o ne. These results are n ot irn mediate 
consequences of classical random matrix theory (e.g., iDavidson and Szarekl (J200lh ). since 
the elements of Q" are highly dependent. 
Recall the definitions 



Q*:=Ee* 7]{X;9*)X\,Xl , and Q 



E 



7]{X;9*)XyX: 



(40) 



where Eg* denotes the population expectation, and E denotes the empirical expectation, and 
the variance function ry was defined previously equation (|lip . The following lemma asserts 
the eigenvalue bounds in Assumption Al hold with high probability for sample covariance 
matrices: 

Lemma 5. Suppose that assumption Al holds for the population matrix Q* and Eg* [XX'^]. 
For any 5 > and some fixed constants A and B, we have 



1 " ■ 

1 

''^[■'^minXQsS) — ^min 



i=l 



<2exp( -^_ + 51og(d) 
b\ <2exp(-A^ + Blog(d) 



(41a) 
(41b) 
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The following result is the analog for the incoherence assumption (^2), showing that 
the scaling of {n,p,d) given in Theorem [1] guarantees that population incoherence implies 
sample incoherence. 



Lemma 6. If the population covariance satisfies a mutual incoherence condition (113p with 
parameter a £ (0, 1] as in Assumption A2, then the sample matrix satisfies an analogous 
version, with high probability, in the sense that 



WQ's^siQss) 



> 1 



< exp 



-K 



n 
d3 



log(p) 



(42) 



Proofs of these two lemmas are provided in the following sections. Before proceeding, we 
begin by taking note of a simple bound to be used repeatedly throughout our arguments. 
By definition of the matrices Q"'{9) and Q{6) (see equations pT|) and ([TO]) ), the {j,kY^ 
element of the difference matrix Q^{0) — Q{0) can be written as an i.i.d. sum of the form 
■^jk = :^ X]r=i -^ifc ' where each Z V is zero- r aean and bounded (in particular, \ZV\ < 4). 
By the Azuma-Hoeffding bound ( Hoeffdina . Il963l ). for any indices j,k = 1, . . . ,d and for 
any e > 0, we have 



1 " 



< 






(43) 



So as to simplify notation, throughout this section, we use K to denote a universal positive 
constant, independent of {n,p, d). Note that the precise value and meaning of K may differ 
from line to line. 

5.1 Proof of Lemma [5] 



By the Courant-Fischer variational representation (JHorn and Johnsonl . Il985l ) , we have 



^miniQss) = min x^Qssx 

||x||2 = l 

= rnin [x^Q'^ss^ + x^ {Qss - Q'ss)^] 

\\x\\2 = l 

< y^Q''ssy + y^{Qss-Q'ss)y, 

where y G M*^ is a unit-norm minimal eigenvector of Q^g- Therefore, we have 

Amm(Q5s) ^ ^miniQss) " WQsS - Q^slh > Cmin - WQsS - Q^slh- 

Hence it suffices to obtain a bound on the spectral norm WQss — Qssllb- Observe that 



21^ 



\\\Qh-Qssh < [EE^^^--^) 

j=l fc=l 

Setting e^ = (5^/d^ in equation (|13]) and applying the union bound over the d^ index pairs 
(j, k) then yields 



niQ'ss-Qss\h>S] < 2exp(-i^^ + 21og(d) 



(44) 
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Similarly, we have 



mmaA-T.^\r(^\rf) > ^max] < 



i=l 



{^E<Hr)^)-^04X\rX^M2>S 



i=l 



which obeys the same upper bound ()44|) . by following the analogous argument. 

5.2 Proof of Lemma [6] 

We begin by decomposing the sample matrix as the sum Q^cgiQ^g)'^ = T1 + T2 + T3 + T4, 
where we define 



T2 
T3 

T4 



[Q^'^S ~ Q*S''S\ iQ*ss) 

[Q^'^s ~ Q*s''s] [(Q^s) ~ iQ*ss) \ 
Q*S'=s(.Q*ss) 



(45a) 
(45b) 
(45c) 
(45d) 



The fourth term is easily controlled; indeed, we have 



^4 00 



WQUiQ 



ss) 



<l-a 



by the incoherence assumption A2. If we can show that |||Tj|||oo < ^ for the remaining 
indices i = 1,2,3, then by our four term decomposition and the triangle inequality, the 
sample version satisfies the bound (j42p . as claimed. We deal with these remaining terms 
using the following lemmas: 

Lemma 7. For any 5 > and constants K,K', the following bounds hold: 

nWQs^s - QhsWoo > 5] < 2 exp f -K^^ + log(d) + log(p - d) ] (46a) 



n6'^ 
niQss - QhWoo > 5] < 2exp ( -K— + 21og(d) 



nim'ss)-' - {Qss)-'\loo >S]< 4exp ( -K 
See Appendix IB] for the proof of these claims. 



d^ 



K'\og{d) 



(46b) 
(46c) 



Control of first term: Turning to the first term, we first re-factorize it as 

Ti = Q*s'^siQ*ss) [Q'ss ~ Q*ss] (Qss) ' 

and then bound it (using the sub-multiplicative property 

III ^-B III 00 < III A III 00 III -B III 00) as follows 

|||ri|||oo < \lQ*S'^S\Q*Ss) llloo III (355 — Qss III 00 III (Qss) llloo 



< (1 - a) WQ^ss - QssWoo {Vd |||(ggs)-'lll2} , 
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where we have used the incoherence assumption A2. Using the bound (|41b[) from Lemma [5] 
with 5 = Cmm/2, we have |||(Qgg)~^|||2 = [AmmCQ^^)]^^ < q ^ with probabihty greater 
than 1 — exp i^—Kn/cP + 2\og{d)). Next, applying the bound ()46bp with 5 = c/y/d, we 
conclude that with probability greater than 1 — 2 exp (^—Knc^/d^ + log((i)), we have 

\lQss-Qh\U < c/^Td. 

By choosing the constant c > sufficiently small, we are guaranteed that 

(TIC \ 

-K^ + \og{d)\. (47) 

Control of second term: To bound T2, we first write 

IIIT2III00 < M{Q*ssr%\\Q's^s-Qhs\U 

We then apply bound (|46ap with (5 = §■ '"r^" to conclude that 

POIIT2III00 > a/3] < 2exp(^-K^+log{p-d)). (48) 



Control of third term: Finally, in order to bound the third term T3, we apply the 
bounds (|46ap and ()46bp . both with 6 = s/ajs, and use the fact that log{d) < log{p — d) to 
conclude that 

IPOIIT3III00 > a/3] < 4exp(-Kj+log(p-d)). (49) 



Putting together all of the pieces, we conclude that 



mQ's^s{Q'ss)-'\\oo>l-a/2\ = o(exp(-i^J + log(p))). 



as claimed. 



6 Experimental results 

We now describe experimental results that illustrate some consequences of Theorem [H for 
various types of graphs and scalings of (n,p, d). In all cases, we solved the £i-regularizedJo;_ 



gistic regression using special purpose interior-point code developed and described bv lKoh et al 
(2007! ) ■ 



We performed experiments for three different classes of graphs: four-nearest neighbor 
lattices, (b) eight-nearest neighbor lattices, and (c) star-shaped graphs, as illustrated in 
Figure [H Given a distribution Pg* of the Ising form ([T]), we generated random data sets 
{x*-^\ . . . ,x^"^} by Gibbs sampling for the lattice models, and by exact sampling for the 
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(b) 



Figure 1. Illustrations of different graph classes used in simulations, (a) Four-nearest 
neighbor grid {d — 4). (b) Eight-nearest neighbor grid {d = 8). (c) Star-shaped graph 

(rf-e(p),ord = e(iog(p))). 



star graph. For a given graph class and edge strength w > 0, we examined the performance 
of models with mixed couplings, meaning 9*^ = zizcu with equal probability, or with positive 
couplings, meaning that 6*^ = uj for all edges {s,t). In all cases, we set the regularization 

-^^^). Above the threshold sample size n predicted by Theorem [H 



parameter as A„ = 0(i 
this choice ensured correct model selection with high probability, consistent with the the- 
oretical prediction. For any given graph and coupling type, we performed simulations for 
sample sizes n scaling as n = 10/?dlog(p), where the control parameter [3 ranged from 0.1 
to upwards of 2, depending on the graph type. 

Figure E] shows results for the 4- nearest-neighbor grid model, illustrated in Figure [TJa), 
for three different graph sizes p G {64,100,225}, with mixed couplings (panel (a)) and 
attractive couplings (panel (b)). Each curve corresponds to a given problem size, and cor- 
responds to the success probability versus the control parameter /?. Each point corresponds 
to the average of A^ = 200 trials. Notice how despite the very different regimes of {n,p) 
that underlie each curve, the different curves all line up with one another quite well. This 
fact shows that for a fixed degree graph (in this case deg = 4), the ratio n/\og{p) controls 
the success/failure of our model selection procedure, consistent with the prediction of The- 
orem [TJ Figure [3] shows analogous results for the 8- nearest-neighbor lattice model (d = 8), 
for the same range of problem size p G {64, 100, 225}, as well as both mixed and attractive 
couplings. Notice how once again the curves for different problem sizes are all well-aligned, 
consistent with the prediction of Theorem [TJ 

For our last set of experiments, we investigated the performance of our method for a 
class of graphs with unbounded maximum degree d. In particular, we constructed star- 
shaped graphs with p vertices by designating one node as the spoke, and connecting it to 
d < (p— 1) of its neighbors. For linear sparsity, we chose d = [O.lp] , whereas for logarithmic 
sparsity we choose d = [log(p)] . We again studied a triple of graph sizes p G {64, 100, 225} 
and Figure [4] shows the resulting curves of success probability versus control parameter 
/? = n/[lQd\og{p)\. Panels (a) and (b) correspond respectively to the cases of logarithmic 
and linear degrees. As with the bounded degree models in Figure [2] and [3l these curves 
align with one another, showing a transition from failure to success with probability one. 
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4-nearest neighbor grid (mixed) 



4-nearest neighbor grid (attractive) 




1 1.5 2 

Control parameter 



1 1.5 2 

Control parameter 



(a) 



(b) 



Figure 2. Plots of success probability P[A/±(r) = A/'(r),Vr] versus the control parameter 
/3(n,p, d) = n/[10d\og{p)] for Ising models on 2-D grids with four nearest-neighbor interac- 
tions {d — 4). (a) Randomly chosen mixed sign couplings 0*^ — ±0.50. (b) All positive 
couplings 9*^ = 0.50. 



8-nearest neighbor grid (mixed) 




0.8 



0.6 



P 0.4 



0.2 



8-nearest neighbor grid (attractive) 
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-<>-p= 64 




-^p = 100 




-•-p = 225 
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Control parameter 
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(a) 



(b) 



Figure 3. Plots of success probability P[A/±(r) ~ A/'(r),Vr] versus the control parameter 
/3{n,p,d) = n/[lQd\og{p)] for Ising models on 2-D grids with eight nearest-neighbor inter- 
actions (d = 8). (a) Randomly chosen mixed sign couplings 6*f. = ±0.25. (b) All positive 
couphngs 0*t = 0.25. 
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star graph; Logarithmic neighbors 
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Star graph; Linear fraction neighbors 
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-p = 64 
-p = 100 

-p = 225| 

1.5 2 



(b) 



Figure 4. Plots of success probability P[JV±{r) ~ A/'(r),Vr] versus the control parameter 
P{n,p, d) — n/[10dlog{p)] for star-shaped graphs, in which d = Q{p), for attractive couplings, 
(a) Logarithmic growth in degrees, (b) Linear growth in degrees. 



7 Conclusion 



We have shown that a technique based on .^i -regularized logistic regression can be used 
to perforin consistent model selection in discrete graphical models, with polynomial com- 
putational complexity and sample complexity logarithmic in the graph size. Our analysis 
applies to the high-dimensional setting, in which both the number of nodes p and maximum 
neighborhood sizes d are allowed to grow as a function of the number of observations n. 
There are a number of possible directions for future work. For bounded degree graphs, 
our results show that the structure can be recovered with high probability once n/log(p) is 
sufficiently lar ge. Up to constant factors, this result matches known information-theoretic 
lower bounds ( Bresler et al.l . l2008l . ISanthanam and Wainwrightl . l2008l ). On the other hand, 
our experimental results on graphs with growing degrees (star-shaped graphs) are consis- 
tent with the conjecture that the logistic regression procedure exhibits a threshold at a 
sample size n = Q{dlogp), at least for problems where the minimum value ^j^j^^ stays 
bounded away from zero. It would be interesting to provide a sharp threshold result for 
this probl em, to parall e l the known thresholds for ^i -regularized linear regression, or the 



Lasso (see IWainwrightl ( 20061 )). Finally, the ideas described here, while specialized in this 



paper to the pairwise binary case, are more broadly applicable to discrete graphical models 
with a higher number of states; this is an interesting direction for future research. 
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A Proofs for Section 4.1 



In this section, we provide proofs of Lemmas [H Lemma [3] and Lemma HI previously stated 
in Section 14. 1[ 



A.l Proof of Lemma [2] 



rn _ 1 Y^fi '7(«) 



Note that any entry of W^ has the form W" = — X]r=i ■^u , where for i = 1, 2, . . . , n, the 
variables 



Z^ :— x^^ ■; X 



.(') {x« - Pe4xr = 1 I x'(^^.]+¥e*[xr 



-1 I X- 



?1} 



are zero-mean under P^*, i.i.d., and bounded (|^i | < 2). Therefore, by the Azuma- 
Hoeffding inequality (|Hoeffding| . ll963l ). we have, for any 5 > 0, P [\W^\ > 5] < 2exp {-—^ 
Setting 5 = ^?2-a) ' ^^ obtain 

To „. „"1 

< 2exp(-i^nA^) 



A„ l^«l^ 4 



for some constant K. Finally, applying a union bound over the indices u of W^ yields 

< 2exp(-KnA^ + log(p)), 



An 4 



as claimed. 



A. 2 Proof of Lemma [3] 

Following a method of iRothman et al.l ( 20081 ) . we define the function G : R'^ ^ M by 

G{us) : = m + ns; {x«}) - £(0^; {x»}) + A„ (||0J + us\\- W^W) . (50) 

It can be seen from equation ()24p that u = 9$ — 0*g minimizes G. Moreover G(0) = by 
construction; therefore, we must have G{u) < 0. Note also that G is convex. Suppose that 
we show that for some radius B > 0, and for u G W^ with \\u\\2 = B, we have G{u) > 0. We 
then claim that 112112 < B. Indeed, if u lay outside the ball of radius B, then the convex 
combination tu + {1 — t)(0) would lie on the boundary of the ball, for an appropriately 
chosen t G (0, 1). By convexity, 

G(tn + (l-t)(0)) < tG(2) + (1 - t)G(O) < 0, 

contradicting the assumed strict positivity of G on the boundary. 

It thus suffices to establish strict positivity of G on the boundary of the ball with radius 
B = MXnvd, where M > is a parameter to be chosen later in the proof. Let u G M be 
an arbitrary vector with ||«||2 = B. Recalling the notation W = \/£{9*; {x^^^}), by a Taylor 
series expansion of the log likelihood component of G, we have 



G{u) = Wju + u^[V^£{e*s + au)]u + Xn{\\e*s + 



us\ 



7SIU> 



(51) 
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for some a G [0, 1]. For the first term, we have the bound 



\Wju\ < ||T^s||oo||n||i < \\Ws\\ooVd\\u\\2 < (^Xn^y^, (52) 

since ||VF5||oo < ^ with probabihty converging to one from Lemma [2j 

Applying the triangle inequality to the last term in the expansion ()5ip yields 

Xn\\e*s + us\\i-Ws\\i>-K\\us\\i > -XnVdWush = -M [y/dX^ . (53) 

Finally, turning to the middle Hessian term, we have 

q*:= A™„(V2^(e^ + au;{x«})) 

> min AminiV^m + aus; {x«})) 

aG[0,l] 



min Amin 
ae[o,i] 



1 



n 



L.^ V^5 



j=l 



By a Taylor series expansion of 7]{x^'^'; •), we have 



> A. 



±^r,(xW;^J)x«(x«) 



i=l 



max|||i5^V(x»;eK«^5)(nIxg))x«(x«)^|||2 
' 1=1 



= A^m^n{Qh)- max ||| - ^ r?'(x«; 0^ an5)(u^xg Vg^ (x«) 

>Cmin- max |||-Vr/'(x«;0j + an5)(tx^xg^)xg)(xg^)^|||2 
«e 0,1] n ^-^ 

z=l 



It remains to control the final spectral norm. For any fixed a G [0, 1] and y G M°' with 
||y||2 = 1, we have 



y^ ^ E^'(^^^^;^5 + aus){u^s4')4\4r y 



i=l 
n 



^Y.^'{x'^^-ei + aus){ulxf) [xffy 



i=l 
n 



< A^|^'(xW;0Kan5)(n^xW) [xf^y 



i=l 



S'^\T, 



Now note that \r]'{x^'^>]9*g + aus)\ < 1, and l^^a^^ | < \/rf||'U5||2 = MXnd. Moreover, we 



„(*)aT, 



1 v^" ^(*)/'^(*)\T 



have \\-Yll=i \xs) y] — lln I^r=i^5 (^5 ) II2 < Z?max by assumption. Combining these 
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pieces, we obtain 



1 

max\l-S2rj'{x^'^;e*s + aus){u^xf)xf{xff\l2 < D^^^^MKd 
aep,i n ^ 

2 = 1 

G77 



, -^mm 



2 ' 
where the last inequahty follows as long as \nd < 2d "'"m • ^^ have thus shown that 



(Z*:=A™„(V2£(eKan;{x«})) > ^ (54) 

with probability converging to one, as long as Xnd is sufficiently small. 

Finally, combining the bounds ([52]) . ([53]) . and (fSH) in the expression ([5T]). we conclude 
that 



G{us) > ^X^Vdf L^M + ^M^-M 



This expression is strictly positive for M = 5/Cmin- Moreover, for this choice of M, we 

have that Xnd must be upper bounded by 2d """m ~ ToD^^^' ^^ assumed in the lemma 
statement. 

A. 3 Proof of Lemma [J] 

We first show that the remainder term i?" satisfies the bound ||-R"||oo < -CmaxH^s — ^slli- 
Then the result of Lemma [3] — namely, that \\6s — OsW'^ ~ C^pi^nVd) — can be used to con- 
clude that 

" ,"°° = Op{Xnd), which suffices to guarantee the claim of Lemma [H 
Focusing on element R^ for some index j G {1, . . . ,p}, we have 

R^ = ^vH{e^^'>;x)-vH{e*;x)Y, [^-^1 

= -V [r/(x«;^(^'))-r/(x«;r)l fx«(x«f 1^ [^- r]. 
n^ I J L }j 

1=1 

for some point ^(^^ = tjO + (1 - tj)e*. Setting g{t) = r^^^, note that r]{x;e) = 
gi^r Yltev\r ^rtXt)- By the chain rule and another application of the mean value theorem, 
we then have 

R] = -^g' (|(^)^x«) (x«)^[^(^) - e*] {xf (x«)^[^- r]} 

i=l 

I E {9 {O^'^^x^'^) xf} {0^'^ - r]^x«(x«)^[^- 9*]] 



n . 
1=1 
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where 9^^' is another point on the hne joining 9 and 9*. Setting Oj := {g' (9^^' x^"^' ) xj } 
and h := {[9'-^^ - r]^x»(xW)^[^- 9*]}, we have 



I n>"l 



'^aibi 



i=l 



< -||a||oo||6||i. 
n 



A calculation shows that ||a||oo < 1, and 



-IPIIi 
n 



0- r]^ I i f]x«(x«)^| [9- 9*] 



i=l 



h[h-o%v\^l±xf{xfYy9, 



i=l J 

where the second line uses the fact that 9s'^ = 9gc = 0. This concludes the proof. 



B Proof of Lemma 7 

Recall from the discussion leading up to the bound (|43p that element (j, k) of the matrix 
difference Q^ — Q* , denoted by Zjk, satisfies a sharp tail bound. By definition of the 
^00-iiiatrix norm, we have 

n\lQ^S^S-Qhs\loo>S] = P[max^|Z,fc|>5] 



jes<= 



fce5 



< ip-d)F[Y^\Z,k\>S], 
keS 

where the final inequality uses a union bound, and the fact that |5^| < p — d. Via another 
union bound over the row elements, we have 

nWQ's^s - Q*s^s\loo >6] <{p-d)d¥ [\Zjk\ > S/d] , 



from which the claim (I46ap follows by setting e = 6/d in the Hoeffding bound ()43p . The 
proof of bound ()46bp is analogous, with the pre-factor (p — d) replaced by d. 
To prove the last claim (I46cp . we write 



\\{Ql;s)-' - {Q*ss)-'\\ 



— \{Q*Ss) [Q*SS ~ Q'ss\{Q'ss) loo 

< ^d\\{Q*ssr'[Q*ss-Q'ss]{Q'ss)-% 

< ^ \KQh)-%\lQsS - Q§5lll2|||(Q§5)-'lll2 

Vd 



< 



G„ 



-\lQss-Qh\hUQh)-%- 



From the proof of Lemma O in particular equation ()44p , we have 

2 



n \-l\ 



UQ'ss) 



> 



c„ 



< 2 exp (-K^ + B log{d)) 
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for a constants B. Moreover, from equation ()44p . we have 

niQ'ss-Qss\h>S/^] < 2exp(-K^ + 2\og{d)Y 
so that the bound (|46c|) follows. 
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